6056-6068 Nucleic Acids Research, 2011. Vol. 39, No. 14 
doi:10.1093lnarlgkr221 



Published online 14 April 2011 



Distinct patterns of somatic alterations in a 
lymphoblastoid and a tumor genome derived 
from the same individual 

Pedro A. F. Galante\ Raphael B. Parmigiani\ Qi Zhao^'^ Otavia L. Caballero^ 
Jorge E. de Souza\ Fabio C. P. Navarro\ Alexandra L. Gerber'^, Marisa F. Nicolas^, 
Anna Christina M. Salim\ Ana Paula M. Silva\ Lee Edsall^, Sylvie Devalle^ 
Luiz G. Almeida^ Zhen Ye^, Samantha Kuan^, Daniel G. Pinheiro^ Israel Tojal^ 
Renato G. Pedigoni^, Rodrigo G. M. A. de Sousa^, Thiago Y. K. Oliveira^, 
Marcelo G. de Paula^ Lucila Ohno-Machado^, Ewen F. Kirkness^, Samuel Levy^, 
Wilson A. da Silva Jr^, Ana Tereza R. Vasconcelos^ Bing Ren^, Marco Antonio Zago^ 
Robert L. Strausberg^^ Andrew J. G. Simpson^ Sandro J. de Souza' and 
Anamaria A. Camargo^'* 

\udwig Institute for Cancer Research, Sao Paulo Branch at Hospital Alemao Oswaldo Cruz, Sao Paulo, 
01323-903 SP, Brazil, ^J. Craig Venter Institute, Rockville, 20850 MD, USA, ^Ludwig Collaborative Group, 
Department of Neurosurgery, Johns Hopkins University, Baltimore, 20850, MD, USA, "^Laboratorio Nacional de 
Computagao Cientifica, Laboratorio de Bioinformatica, Petropolis, 25651-075 RJ, Brazil, ^Ludwig Institute for 
Cancer Research, San Diego Branch, San Diego, 92093 CA, USA, ^Departamento de Genetica, Faculdade de 
Medicina de Ribeirao Preto, Universidade de Sao Paulo, Ribeirao Preto, 14049-900 SP, Brazil, ''Division of 
Biomedical Informatics, University of California, San Diego, 92093 CA, USA and ^Departamento de Ch'nica 
Medica, Centro de Terapia Celular e Banco de Sangue, Universidade de Sao Paulo, Ribeirao Preto, 14051-140 
SP, Brazil 

Received November 28, 2010; Revised IVIarcli 22, 2011; Accepted IVlarcli 25, 2011 



ABSTRACT 

Although patterns of somatic alterations have been 
reported for tumor genomes, little is known on how 
they compare with alterations present in non-tumor 
genomes. A comparison of the two would be crucial 
to better characterize the genetic alterations driving 
tumorigenesis. We sequenced the genomes of a 
lymphoblastoid (HCC1954BL) and a breast tumor 
(HCC1954) cell line derived from the same patient 
and compared the somatic alterations present in 
both. The lymphoblastoid genome presents a com- 
parable number and similar spectrum of nucleotide 
substitutions to that found in the tumor genome. 
However, a significant difference in the ratio of 
non-synonymous to synonymous substitutions was 
observed between both genomes (P = 0.031). 
Protein-protein interaction analysis revealed that 
mutations in the tumor genome preferentially 



affect hub-genes (P = 0.001 7) and are co-selected 
to present synergistic functions (P< 0.0001). KEGG 
analysis showed that in the tumor genome most 
mutated genes were organized into signaling 
pathways related to tumorigenesis. No such 
organization or synergy was observed in the 
lymphoblastoid genome. Our results indicate that 
endogenous mutagens and replication errors can 
generate the overall number of mutations required 
to drive tumorigenesis and that it is the combination 
rather than the frequency of mutations that is crucial 
to complete tumorigenic transformation. 

INTRODUCTION 

Somatic genetic alterations accumulate in the genome of 
all dividing cells as a result of DNA replication errors or 
exposure to mutagens. Some somatic alterations, known 
as driver mutations, confer a selective growth advantage 
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to the cells and can cause cancer. Conversely, passenger 
alterations are biologically neutral and are expected to 
occur in normal genomes (1,2). Recent advances in 
sequencing technologies have allowed a comprehensive 
characterization of the genetic alterations occurring in in- 
dividual tumors, including copy number variations 
(CNVs), chromosomal rearrangements, point mutations 
and small insertions and deletions (Indels) (3-11). 
However, the frequency and characteristics of the 
genetic alterations driving tumorigenesis are not currently 
well-defined. 

Although the sequences of six-matched tumor and 
non-tumor genomes have been published (4-8,11), these 
studies were limited to the identification of somatic alter- 
ations present in the tumor genomes and there has been no 
attempt to define the somatic alterations present in the 
matched non-tumor genome. A comparison of the two 
would be crucial to better characterize the genetic 
changes that drive tumorigenesis, since we expect most, 
if not aU, alterations present in the non-tumor genome 
to be passenger mutations. 

Here, we have sequenced the genome of a breast tumor 
cell hne (HCC1954) and a lymphoblastoid cell line 
(HCC1954BL) derived from the same patient and 
compared the genetic variations occurring in both. 
HCC1954 is an immortal, pseudo-tetraploid (12), publicly 
available tumor cell line derived from a hormone receptor 
negative, ERBB2 positive, primary breast tumor from a 
61 -year-old female patient. HCC1954BL is an Epstein- 
Barr virus (EBV)-transformed lymphoblastoid ceU hne 
derived from the same patient. Both cell lines received 
similar treatments in terms of the timing of establishment 
and in vitro propagation (36 passages) (13) and the 
sequencing data revealed both hnes to be clonal, permitting 
the detection of somatic changes in both. 

Using a combined sequencing approach, we were able 
to characterize chromosomal rearrangements and single 
nucleotide variations (SNVs) in the protein-coding 
regions of the HCC1954 and HCC1954BL genomes. By 
comparing the sets of rearrangements and SNVs present in 
both genomes, we were able to exclude those that are 
common to both genomes and correspond to inherited 
variations and to identify those that are exclusive to 
each genome and likely correspond to somatic alterations 
that have independently accumulated in the genomes of 
these cell hnes during their hfespan. Thus, in the present 
work, in addition to the identification of somatic muta- 
tions present in the tumor genome, we have also identified 
those present in the matching lymphoblastoid genome and 
used a system biology approach to better characterize the 
functional differences between the set of altered genes 
present in both genomes. 

We found that the HCC1954BL genome contains very 
few somatically acquired chromosomal rearrangements 
but, surprisingly, present a comparable number of 
somatic point mutations and a similar spectrum of nucleo- 
tide substitutions to that found in the HCC1954 genome. 
We also observed that, unlike in the HCC1954BL genome, 
non-synonymous mutations present in the HCC1954 
genome were not randomly distributed and were 
co-selected to present synergistic tumor-promoting 



functions and to affect hub-genes in functional pathways 
related to tumorigenesis. Our results provide important 
insights into the normal mutational processes and into 
the functional implications of the accumulation of 
somatic mutations in a tumor and a matching 
lymphoblastoid genome. 

MATERIALS AND METHODS 

Cell lines and DNA extraction 

HCC1954 and HCC1954BL ceU lines were obtained from 
American Type Culture Collection (ATCC) and were 
maintained in RPMl medium containing 10% fetal 
bovine serum (FBS) and non-essential amino acids. Both 
cell lines received similar treatments in terms of the timing 
of estabhshment and in vitro propagation and present 
similar prohferation rates (data not shown). Both of 
them were received from ATCC at passage 25 after 
in vitro establishment and were maintained in culture 
until passage 36 when DNA was extracted for sequencing. 
DNA was isolated using the DNeasy Blood & Tissue Kit 
(Qiagen, Valencia, CA, USA). Genomic DNA was treated 
with RNase CocktaiF"^ (Ambion Austin, TX, USA), 
foUowed by phenol-chloroform extraction and precipita- 
tion of the aqueous phase in 1/10 volume 3M sodium 
acetate, pH5.2 and 100% ethanol. 

Public genome data 

The human reference genome sequence (NCBI build 
36.1/hgl8) was downloaded from the UCSC Genome 
Browser (http://genome.ucsc.edu). Alternative haplotype 
regions, including the inimunoglobuhn loci, were 
excluded from the reference sequence because of their 
highly polymorphic and rearranged structure. Human ref- 
erence mRNA sequences (RefSeqs for coding mRNAs, 
'NM' and non-coding mRNAs, 'NR') were downloaded 
from NCBI (http://www.ncbi.nlm.nih.gov/RefSeq) and 
mapped to the human genome as previously described 
(14). Known SNPs were also downloaded from the 
UCSC Genome Browser (dbSNP version 130) and 
loaded into a MySQL database. 

gDNA paired-end sequencing and mapping 

A total of 5 |.ig of genomic DNA was randomly sheared 
using a Biorupter according to the manufacturer's instruc- 
tions. The fragmented DNA was end-repaired using 
Klenow and T4 DNA polymerases and phosphorylated 
at the 5'-end with T4 polynucleotide kinase. A 3' 
overhang was created using 3'-5' exonuclease-deficient 
Klenow fragment and Illuniina paired-end adaptor oligo- 
nucleotides were ligated to the created sticky ends. DNA 
fragments of ~200bp were size selected in 8% polyacryl- 
amide gels and eluted from the gel overnight. Size-selected 
DNA was Polymerase Chain Reaction (PCR) amplified 
for 18 cycles to enrich for adapter-modified DNA frag- 
ments. A paired-end flow cell was prepared on the 
supphed cluster station according to the manufacturer's 
protocol. Clusters of PCR colonies were then sequenced 
on the lllumina GAII sequencing platform using the 
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recommended protocols. Images from the instrument were 
processed to generate sequence files in FASTAQ format. 
Sequences were aligned to the human genome reference 
sequence (NCBI build 36.1/hgl8) using Bowtie version 
0.12.2 (15), allowing two mismatches in mapping. 
Duplicated read pairs with identical coordinates were 
merged, and only unambiguously mapped reads were 
used for the structural variation analysis. 

Exome capture, sequencing and mapping 

A total of 34. 1 Mb of the human genome corresponding to 
approximately 180000 coding exons (28.4 Mb) and 
adjacent intergenic and intronic regions (5.7 Mb) was 
captured using the Nimblegen Sequence Capture 2.1 M 
Human Exome array vl.O. Briefly, genomic DNA 
samples were fragmented, hgated to adapters and 
hybridized to the exome capture array. Unbound frag- 
ments were washed away and the target-enriched DNA 
was eluted. Enriched samples were amplified by Ligation- 
Mediated PCR (LM-PCR). The adapters used during en- 
richment were designed for direct integration into the 
workflow of the 454 GS FLX instrument, eliminating the 
hbrary construction process. We continued the protocol 
from the DNA library quantification and emulsion PCR 
according to manufacturer's protocol. Sequencing was per- 
formed using the 454 GS FLX Titanium platform. 
Sequences were ahgned to the human genome reference 
sequence (NCBI build 36.1/hg 18) using BEAT program 
(parameters: noTrimA -tileSize = 12 -minScore = 200 
— minldentity = 92 -out = pslx) (16). Duplicated reads 
with identical coordinates were merged, and only unam- 
biguously mapped reads were used for SNV calling and 
mutation detection. 

Data deposition and availability 

Quality scores and FASTA sequences generated for 
HCC1954 and HCC1954BL were uploaded to the Short 
Read Archive under the accession numbers ERA010917 
and ERAO 11762 and are publicly available. 

Structural variation analysis 

lUumina paired-end reads that failed to ahgn to the 
human genome reference sequence in the expected orien- 
tation or distance were used in the structural variation 
analysis after removing those that mapped to highly re- 
petitive regions within 1 Mb of a centromeric or telomeric 
region. Reads for which the two ends aligned to within 
the expected distance, but with one of the two ends in the 
incorrect orientation were also excluded from the analysis 
because these reads are likely to be artifacts generated by 
mispriming of the Illumina sequencing ohgonucleotide or 
intramolecular rearrangements generated during library 
amplification. Structural variants were called from 
Bowtie ahgnnients of genomic paired-end sequences, 
requiring at least five independent read pairs in 
HCC1954 and no read pairs representing the rearrange- 
ment in HCC1954BE and vice versa. Interchromosomal 
rearrangements were called when each read from the same 
pair was mapped uniquely, but in distinct chromosomes. 
Intrachromosomal deletions were called when the 



read-pairs mapped in an expected order and orientation, 
but in a distance greater than the expected 
(average + 4* SD). Intrachromosomal tandem duphcations 
were called when read-pairs mapped in an expected orien- 
tation, but in an unexpected order and distance. 
Intrachromosomal inversions were called when read-pairs 
mapped in an expected order, but in an unexpected orien- 
tation and distance (for a review, see Ref. 17). Structural 
variants represented by read pairs that mapped to within 
500 bp of a previously identified copy number polymorph- 
ism were removed from the final list of somatic chromo- 
somal rearrangements. 

SNVs and point mutation analysis 

SNVs and somatic point mutations were independently 
called for each cell line from BEAT alignments of 
capture sequences to the human reference genome. SNVs 
supported by at least three reads with base quality >20 
were called for the HCC1954 and HCC1954BL genomes. 
To assess the sensitivity of our SNV detection strategy, we 
have compared our SNV calls to SNV calls extracted from 
genotyping array data available for both cell lines (GEO: 
GSE12019 and GSE13373), as previously described (5). 
SNVs common to both genomes and/or already described 
in dbSNP were excluded from the somatic point mutation 
analysis, since they likely correspond to inherited sequence 
variants. Somatic point mutations were then identified, 
requiring at least three high-quahty independent reads 
(>3 reads with Phred score >20 and >1 read with Phred 
score >30) reporting the variant in HCC1954 and no reads 
reporting the variant in HCC1954BE and vice versa. 
Variant reads were required to represent at least 20% of 
the total number of reads covering the variant genomic 
position to filter somatic mutations that might have even- 
tually arisen during in vitro propagation of both cell fines 
and present in a small subpopulation of the cells. A depth 
of at least 5 x was required to assure sufficient coverage in 
both cell fines. We also excluded from the point mutation 
analysis false mutation caUs residing in regions of 
loss-of-heterozygosity (LOH) in either of the ceU fines. 
Briefly, the zygosity of approximately 250000 known 
SNPs represented in the Affymetrix SNP array was 
determined for both cell fines using hybridization data 
available in pubfic databases (GEO: GSE12019 and 
GSE13373). LOH regions (where SNPs were heterozygous 
in the normal but homozygous in the tumor and vice 
versa) were identified using Hidden Markov Model algo- 
rithms as previously described (7,8). Results were 
manually inspected and exome sequence data were used 
to confirm the SNP array analysis and to increase reso- 
lution in regions of low-SNP density represented in the 
Affymetrix SNP array. 

Ratio of non-synonymous to synonymous substitutions 

Monte Carlo simulations were used to determine if the 
ratio of non-synonymous to synonymous substitutions 
(dN/dS) observed for HCC1954 and HCC1954BE were 
significantly different from that expected by chance (null 
hypothesis). In these simulations, values expected by 
change were obtained from 1000 random sets of 
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64 mutations for HCC1954 and 30 mutations for 
HCC1954BL occurring in the coding region of known 
human RefSeq genes. To calculate if the difference 
between the dN/dS ratios observed for HCC1954 and 
HCC1954BL was significant, we performed a test 
using the HCC1954BL dN/dS values as expected values 
for HCC1954. 

PCR and Sanger sequencing validation 

Primers for vahdation were designed to target regions im- 
mediately flanking point mutations using Primers3 
software (http://frodo.wi.mit.ed/primer3/). PCR amplifi- 
cation was performed on HCC1954 and HCC1954BL 
genomic DNA using Taq Platinum Hi-Fidelity 
Polymerase (Invitrogene), following standard protocols. 
PCR product purity and size were assessed on 2% 
agarose gels stained with GelRed (Biotium). Sanger 
sequencing was performed using an AB13130 Capillary 
DNA Analyzer. Sequence trace files were manually 
analyzed for point mutations. All confirmations were per- 
formed using both HCC1954 and HCC1954BL DNA to 
determine, if the variants were somatic or germhne. 
Vahdated non-synonymous mutations were classified as 
non-conservative if they result in changes in amino acid 
charge and polarity. Functional proteins domains were 
determined using Pfam and HMMER package 
(£■< 0.001). A Perl-script was used to cross information 
on the position of non-synonymous mutations and func- 
tional domains. Evolutionary conserved amino acid 
residues were obtained from UCSC Genome Browser 
Conservation Track (http://genome.ucsc.edu). Amino 
acid conservation between 10 different species. Pan trog- 
lodytes (Chimp), Macaca mulatto (Rhesus), Mus musculus 
(mouse), Rattus norvegicus (rat), Canis lupus familiaris 
(Dog), Loxodonta africana (elephant), Monodelphis 
domestica (Opossum), Gallus gallus (chicken) and 
Takifugu rubripes (fugu), was manually determined. 

KEGG and protein-protein interactions analysis 

The list of genes carrying non-synonymous mutations 
vahdated by Sanger sequencing for each ceU line was 
uploaded to the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) database for the pathway enrichment 
analysis. Genes carrying non-synonymous mutations were 
used since these mutations are more likely to have a func- 
tional impact when compared to synonymous mutations. 
Monte Carlo simulations were performed in which 1000 
random sets of 45 genes and 12 genes were evaluated re- 
garding KEGG pathway enrichment to address whether 
the obtained results were significantly different {P < 0.05) 
from those expected by chance. Simulations were per- 
formed using all known coding genes and KEGG 
pathways (200 annotated pathways). For the protein- 
protein interactions (PPI) analysis, data from the follow- 
ing databases were merged: MINT (December 2007 
version), BIOGRID (2.0.37), INTACT (January 2008 
version), HPRD (September 2007 version), BIND (May 
2006 version) and DIP (January 2008 version). These 
databases include both high-throughput experiments and 
low-throughput ones curated from the literature. When 



a field was present for the technique used to discover 
the interaction, those registers with an entry that 
referred to one of the several mass spectrometry-based 
methods were excluded to avoid including indirect inter- 
actions. Exclusively functional interactions, for example, 
numerous exclusively genetic interactions present in 
BIOGRID, were also excluded. Only genes carrying 
non-synonymous mutations validated by Sanger 
sequencing were used in the PPI analysis. Monte Carlo 
simulations were performed in which 10000 random sets 
of 25 and 8 genes with known PPI were evaluated regard- 
ing the degree of interaction and the number of common 
interaction partners among mutated proteins to address 
whether the obtained results were significantly different 
{P < 0.05) from those expected by chance. 



RESULTS 

Sequencing strategy and genome coverage 

A combined sequencing strategy was used to characterize 
chromosomal rearrangements and point mutations in 
protein-coding regions of the HCC1954 and 
HCC1954BL genomes (Figure 1). Approximately 381 
million and 347 milhon 35-75 bp paired-end reads from 
200 bp DNA inserts were generated for HCC1954 and 
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Figure 1. Sequencing strategy. Outline of the sequencing strategy and 
bioinformatics algorithms used for the identification of point mutations 
and structural chromosomal rearrangements in the HCC1954 and 
HCC1954BL genomes. 
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Table 1. Summary of sequence generation and mapping to the reference human genome sequence for the HCC1954 and HCC1954BL cell lines 



HCC1954 HCC1954BL 





Capture sequencing 


Paired-end sequencing 


Capture sequencing 


Paired-end sequencing 


Total number of reads 


5 996 389 


381 274 888 


6265250 


347 891 568 


Mapped reads 


5212428 


254 326 859 


5106 763 


237 886 727 


Percentage of mapped reads 


86.9 


66.7 


81.5 


68.4 


Total number of nucleotides 


3 143 589 263 


19 392 752 128 


3 252428 887 


15 693 171 704 


Mapped nucleotides 


2 257 027 363 


13432965012 


2175 120 803 


11 166 288816 


Percentage of mapped nucleotides 


71.8 


69.3 


66.7 


71.1 



HCC1954BL, respectively, using an Illumina GAII 
sequencing platform (Figure 1 and Table 1). For 
HCC1954, ~254 million paired-end reads were unambigu- 
ously mapped to the reference genome, which based on the 
average insert size of ~200bp (Supplementary Figure SI) 
generated 8x physical coverage. Similar numbers were 
generated for HCC1954BL (~237 million mapped 
paired-end reads and 8x physical coverage). Paired-end 
reads that aligned discordantly with respect to each 
other on the reference genome were then used to detect 
chromosomal rearrangements in the HCC1954 and 
HCC1954BL genomes (Figure 1). 

In addition, the NimbleGen Sequence Capture Human 
Exome Array was used to capture 34.1 Mb of the human 
genome corresponding to approximately 180000 coding 
exons (28.4 Mb) and adjacent regions (5.7 Mb). 
Approximately, 2.3 Gb of unambiguously mapped se- 
quences were generated for each cell Hne and over 32% 
of the bases mapped to the targeted regions (Figure 1 and 
Table 1). The average fold-coverage was 22.8 for 
HCC1954 and 21.7 for HCC1954BL (Supplementary 
Figure S2). Captured sequences mapped to the reference 
genome were then used for the detection of SNVs and 
somatic point mutations in protein-coding exons and 
adjacent regions from both genomes (Figure 1). For 
HCC1954, 95.2% of targeted bases were covered at least 
once and 92% met our criteria for SNV calling. Similar 
numbers were obtained for HCC1954BL (95.7% of 
targeted bases covered at least once and 93.8% met our 
criteria for variant calUng). 

Chromosomal rearrangements 

We used a paired-end sequencing strategy to identify 
structural chromosome variations present in the 
HCC1954 and HCC1954BL genomes (18). Paired-end 
reads that aligned discordantly with respect to each 
other, on the reference human genome, were identified 
for HCC1954 (76407 paired-end reads) and 
HCC1954BL (55 967 paired-end reads). From this set of 
aligned reads, we excluded those that precisely duplicated 
other sequences derived from the same library, those that 
could not be unumbiguosly mapped to the human 
genome, and those that mapped within 1 Mb of a telomer- 
ic or centromeric sequence gap. Structural variants were 
called from the remaining paired-end sequences, requiring 
at least five independent read pairs exclusively reporting 
the variant in one of the cell lines. Structural variants 



Table 2. Somatic point mutations and structural variations in the 
HCC1954 and HCC1954BL genomes 



Somatic variations 


HCC1954 


HCC1954BL 




N (%) 


N (%) 


Point mutations 


274 (100) 


173 (100) 


Coding 


64 (23.36) 


30 (17.3) 


Nonsense 


2 (0.73) 


3 (1.7) 


Missense 


45 (16.42) 


15 (8.7) 


Synonymous 


17 (6.20) 


12 (6.9) 


Non-coding 


14 (5.11) 


15 (8.7) 


UTR 


13 (4.74) 


13 (7.5) 


ncRNA 


1 (0.36) 


2 (1.2) 


miRNA 


0 (0) 


0 (0) 


Intronic 


179 (65.33) 


114 (65.9) 


Splice site 


0 (0) 


0 (0) 


Other intronic 


179 (65.33) 


114 (65.9) 


Intergenic 


17 (6.20) 


14 (8.1) 


Structural variations 


94 (100) 


4 (100) 


Interchromosomal 


49 (52.1) 


0 (0) 


Intrachromosomal 


45 (47.9) 


4 (100) 


Deletions 


30 (31.9) 


2 (50.0) 


Inversions 


11 (11.7) 


2 (50.0) 


Duplications 


4 (4.3) 


0 (0) 


UTR = untranslated region. 


ncRNA = non-coding 


RNA. 



represented by read pairs that mapped to within 500 bp 
of a known copy nuinber polymorphism were removed 
froin the final fist of somatic chromosomal rearrange- 
ments (Figure 1). 

A total of 94 structural rearrangements were detected in 
HCC1954 including 49 interchromosomal events, 30 dele- 
tions, 11 inversions and 4 duplications (Table 2 and 
Figure 2). Of the 49 interchromosomal events, 38 
affected genie regions and 22 had already been described 
for HCC1954 (10,19). In contrast, no interchromosomal 
rearrangements were detected in the HCC1954BL genome 
and all four intrachromosomal rearrangements, including 
two deletions and two inversions, were located in 
intergenic or intronic regions (Table 2 and Figure 2). 

Variant calling and somatic point mutations 

Captured sequences (on-target and off-target) mapped to 
the reference human genome were used for variant calling 
and point mutation detection in coding regions and 
adjacent sequences. A total of 82 355 and 83 474 SNVs 
supported by at least three reads with base quality >20 
were called for the HCC1954 and HCC1954BL genomes, 
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Table 3. Single nucleotide variations identified in the HCCI954 and 
HCC1954BL genomes 





iV \ /O) 111 QUiJiNI^ 


HPf^l QS4RT 

JV ^, /0_^ 111 QUkji>J^ 


Substitutions 




83 4/4 (yj.uUJ 


Coding 


11 717 (90.92) 


12 373 (93.84) 


Intronic 


60 314 (92.53) 


61428 (93.77) 


UTR 


3419 (92.57) 


3570 (94.04) 


ncRNA 


256 (96.87) 


260 (96.92) 


Intergenic 


6649 (91.84) 


5843 (90.86) 


Indels 


689 (52.10) 


587 (52.81) 


Coding 


38 (50.00) 


31 (51.61) 


Intronic 


595 (52.43) 


506 (54.15) 


UTR 


30 (46.66) 


26 (42.30) 


ncRNA 


1 (100.00) 


1 (0.00) 


Intergenic 


25 (52.00) 


23 (39.13) 



UTR = untranslated region, ncRNA = non-coding RNA 



respectively (Table 3). As expected, most of the SNVs were 
common to both genomes and the majority (92%) of these 
inherited variants has already been described in dbSNP. 
The rate of novel variant discovery (8%) for this individ- 
ual is consistent with other pubhshed whole human 
genome sequences (20-24). 

To assess the coverage depth and sensitivity of our SNV 
calhng strategy, we compared our SNV calls to SNV calls 
extracted from genotyping array data available for both 
cell lines. (GEO: GSE12019 and GSE13373) as previously 
described (5). In total, 93.7% and 97.8% of the heterozy- 
gous array calls located within the captured regions were 
sequenced at least once for HCC1954 and HCC1954BL, 
respectively, and that 80.8% and 83.3% of the heterozy- 
gous array calls for HCC1954 and HCC1954BL, respect- 
ively, were correctly identified by our sequencing strategy 
and SNV calhng criteria. The difference in the SNV calhng 
efficiency observed between both cell fines is not statistic- 
ally significant {P = 0.69, x' = 0.16, df= 1) and together 
these results demonstrated that both genomes were suffi- 
ciently and equally covered for SNV calhng and mutation 
detection. 

To identify SNVs that are specific to each cell hne and 
hkely to correspond to somatic point mutations occurring 
in each genome, we excluded those already described as a 
known SNP in dbSNP and estabfished a set of stringent 
filtering criteria (Figure 1). Briefly, variants had to be rep- 
resented by at least three high-quality reads from one ceU 
hne and no reads from the other. Variant reads were 
required to correspond to at least 20% of the total 
number of reads covering the variant position to eliminate 
point mutations that might have eventually arisen during 
in vitro culturing and a depth of at least 5 x for the variant 
base was also required for each cefi hne to assure sufficient 
coverage in both cell Unes. Variants for HCC1954BL were 
further filtered to remove those residing in regions of LOH 
in HCC1954 (see 'Materials and Methods' section). 

A total of 274 point mutations were predicted in the 
HCC1954 genome of which 64 (23.4%) occurred in 
protein-coding regions, 14 (5.1%) in non-coding regions, 
179 (65.3%) in intronic regions and 17 (6.2%) in 
intergenic regions (Figure 2 and Table 2). Of the 64 



point mutations occurring in coding regions, 47 (73.4%) 
were predicted to cause amino acid changes 
(non-synonymous), including 45 that were missense and 
two that were nonsense. 

The same set of criteria was used to predict mutations 
present in the HCC1954BL genome. A total of 173 point 
mutations were predicted for HCC1954BL, of which 30 
(17.3%) occurred in protein-coding regions, 15 (8.7%) in 
non-coding regions, 114 (65.9%) in intronic regions and 
14 (8.iyo) in intergenic regions (Figure 2 and Table 2). Of 
the 30 point mutations occurring in coding regions, 18 
(60%) were non-synonymous, including 15 that were 
missense and 3 that were nonsense. 

Non-synonymous substitutions and mutation spectrum 

Since some of the mutations present in the HCC1954 
genome could have arisen during normal breast develop- 
ment as a result of the normal mutational process, we next 
sought to determine if the set of mutations occurring in 
the HCC1954 genome is enriched for driver mutations 
and/or occur in genes that are functionally related to the 
tumorigenesis. To address this issue, we first analyzed the 
ratio of non-synonymous to synonymous substitutions 
(dNs/dS) in the protein-coding regions of both genomes. 
The dNs/dS ratio is commonly used to estimate the degree 
of selection on non-synonymous changes, assuming that 
most synonymous mutations are biologically neutral. This 
ratio for HCC1954 is 2.8 {P = 0.37, Monte Carlo simula- 
tion) and for HCC1954BL is 1.5 (P = 0.18, Monte Carlo 
simulation), not significantly different from that expected 
by chance. However, it is notable that the difference in the 
non-synonymous/synonymous ratio between the cell fines 
is statistically significant (/• = 0.031; = 4.68; df = 1), 
indicating that non-synonymous mutations are more 
frequent in HCC1954 than in HCC1954BL. We also 
investigated the type of nucleotide changes present in 
HCC1954 and HCC1954BL genomes. Both genomes pre- 
sented a similar spectrum of nucleotide substitutions, with 
a predominance of transitions, which change a purine to 
purine (A^G) or pyrimidine to pyrimidine (Co-T) 
(Figure 3). 

Sanger validation and KEGG analysis 

To better characterize the genetic changes that drive 
tumorigenesis, we validated by capillary sequencing all 
non-synonymous point mutations present in the 
HCC1954 and HCC1954BL genomes. Of the 47 
non-synonymous mutations present in the HCC1954 
genome, 33 (70.2%) have already been described in the 
hterature and 12 out of the 14 (85.7%) novel non- 
synonymous mutations were confirmed by capillary 
sequencing (Supplementary Table SI). Of the 18 non- 
synonymous point mutations predicted for HCC1954BL, 
12 (66.6%) were also detected by capillary sequencing 
(Supplementary Table S2). Of the 45 non-synonymous 
mutations identified for HCC1954, 29 (64.4%) resuh in 
non-conservative amino acid changes, 19 (42.2%) occur 
within functional protein domains and 42 (93.3%) occur 
in evolutionary conserved amino acids residues. 
For HCC1954BL 8 (66.6%) of the 12 non-synonymous 
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Figure 3. Spectrum of nucleotide substitutions in the HCC1954 and HCC1954BL genomes. Frequency of point mutations in each of the six possible 
nucleotide substitution classes (A > C|T > G, A > G|T > C, A > T|T > A, G > A|C > T, G > C|C > G, G > T|C > A) observed in the HCC1954 (blue) 
and HCC1954BL (orange) genomes. 



Table 4. KEGG pathway analysis for genes with validated non-synonymous mutations present in the HCC1954 and 
HCC1954BL genomes 



KEGG ID 


KEGG annotation 


Number of genes 
in the pathway 


Gene Name 


/"-value 


HCC1954 










hsa05222 


Small cell lung cancer 


3 


ITGA6 


0.0003 








TP53 










TRAF2 




hsa05410 


Hypertrophic cardiomyopathy 


2 


ITGA6 


0.0167 








MYH7 




hsa04210 


Apoptosis 


2 


TP53 


0.0169 








TRAF2 




hsa05414 


Dilated cardioinyopathy 


2 


ITGA6 


0.0191 








MYH7 




hsa04010 


MAPK signaling pathway 


3 


ARRBl 


0.0237 








TP53 










TRAF2 




hsa00770 


Pantothenate and CoA biosynthesis 


1 


DPYD 


0.0325 


hsa04360 


Axon guidance 


2 


CFL2 


0.0335 








SEMA3A 




hsa04614 


Renin-angiotensin system 


1 


LNPEP 


0.0372 


hsa05200 


Pathways in cancer 


3 


ITGA6 


0.0375 








TP53 










TRAF2 




HCC1954BL 










hsa03440 


Homologous recombination 


1 


EMEl 


0.0234 


lisa00310 


Lysine degradation 


1 


SETD2 


0.0382 


hsa04740 


Olfactory transduction 


2 


OR51E2 


0.0421 








OR2D2 





mutations result in non-conservative amino acid changes, 
6 (50%) occur witliin functional protein domains and 11 
(91.66%) occur in evolutionary conserved amino acids 
residues. 

We next compared the sets of genes with validated 
non-synonymous mutations present in the two genomes. 
First, we determined whether these sets were enriched for 
specific signahng pathways related to tumorigenesis. 
According to KEGG (25), the set of mutated genes in 
HCC1954 was significantly enriched for genes related to 
apoptosis (P = 0.017, Monte Carlo Simulation), MAPK 
signahng {P = 0.023, Monte Carlo Simulation), axon 
guidance and cell migration (P = 0.033, Monte Carlo 
Simulation) and main pathways altered in cancer 



(P = 0.037, Monte Carlo Simulation). In contrast, no en- 
richment for any specific pathway related to cancer was 
observed for the set of genes mutated in HCC1954BL 
(Table 4). 

Protein-protein Interaction and functional networks 

We also examined known PPI to investigate the organiza- 
tion of mutated genes into functional networks. We first 
determined for each genome the percentage of genes with 
validated non-synonymous mutations that had at least 
one known protein interaction with any other protein. 
Similar percentages were obtained for both the 
HCC1954 (55.5%, 25/45) and HCC1954BL (66.7%, 
8/12) cell hues, indicating that there is no difference in 
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Figure 4. Protein-protein interactions networks for mutated genes in HCC1954 (A) and HCC1954BL (B). Proteins with validated non-synonymous 
mutations are represented as red circles and each line represents a confident interaction. Interaction partners with mutated genes are represented in 
green if they interact with three mutated proteins or in light blue if they interact with two mutated proteins. 



terms of representation of the mutated genes for each cell 
line in this PPI database {P = 0.729, = 0.12, df= 1). 

We then analyzed the average number of interactions 
for each mutated protein because proteins with larger 
numbers of interactions have been suggested to serve as 
essential hubs of molecular pathways (26). Proteins 
mutated in HCC1954 interact with a higher average 
number of partners than proteins mutated in 
HCC1954BL (avg 33.2 versus 5.1 proteins. Figure 4). To 
exclude the possibility that the higher degree of interaction 
observed for HCC1954 was due to the larger number of 
mutated proteins in this cell hne, a Monte Carlo simula- 
tion was performed in which 10000 random sets of 25 
proteins with known PPI were evaluated regarding their 
degree of interaction. Only 17 out of the 10000 simulated 
sets had a higher average degree than that observed for 
mutated proteins in HCC1954 {P = 0.0017, Monte Carlo 
simulation), indicating that the number of interactions 
observed for proteins mutated in HCC1954 was signifi- 
cantly different from that expected by chance. The same 
strategy showed that the average degree of interaction for 
proteins mutated in HCC1954BL was similar to that 
expected by chance {P = 0.875, Monte Carlo 
Simulation). To confirm that the differences in the 
degree of interaction observed between both ceU fines 
were not influenced by the smaller number of mutated 



proteins in HCC1954BL, a Monte Carlo simulation was 
again performed in which 1000 random sets of 5 proteins 
from HCC1954 carrying non-synonymous mutations and 
presenting known PPI were evaluated regarding the 
average number of interactions. The number obtained is 
higher than that obtained for the five mutated proteins in 
HCC1954BL (34.7 versus 5.1) and again different from 
that expected by chance {P = 0.001, Monte Carlo 
simulation). 

Interestingly, genes related to apoptosis (TPS 3, TRAF2, 
SLC25A5), MAPK signahng (TP53, ARRBl, TRAF2), 
cell adhesion {ITGA6), cytoskeleton organization 
(PC NT, CLIPl) and cell cycle iRFC4, PCNT) were 
among the HCC1954 mutated proteins displaying a 
higher number of interactions (>10 interactions, 
Figure 4). To evaluate if the higher degree of interaction 
observed for HCC1954 was just a consequence of the 
proteins mutated in this cefi line belonging to molecular 
pathways with higher connectivity, we selected all proteins 
from all KEGG pathways containing at least one mutated 
protein in HCC1954 and HCC1954BL (e.g. these sets 
include aU proteins from MAPK pathway for HCC1954 
and all proteins from Lysine degradation pathway for 
HCC1954BL) and calculated the average number of inter- 
actions for both sets of proteins. The set for HCC1954 
contained 2311 proteins with an average of 18.4 
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Table 5. Protein-protein interaction analysis for genes with non-synonymous mutations in other solid tumors 



References 


Tumor type 


Number of genes 


Number of mutated 


Average number of 


Number of mutated 


Number of 






with non-synonymous 


genes with PPI 


interactions for 


genes with common 


common 






mutations 


information (%) 


mutated genes 


partner (%) 


partners 










(P-value) 


(P-value) 


(P-value) 


Pleasance et al. (8) 


Lung 


90 


50 (56) 


11.6 (0.2692) 


33 (66) (0.0001) 


42 (0.0870) 


Pleasance et al. (7) 


Melanoma 


188 


100 (53) 


8.3 (0.8344) 


69 (69) (0.0001) 


103 (0.3130) 


Ding et al. (4) 


Breast basal 


29 


17 (59) 


8.1 (0.2210) 


7 (41) (0.0001) 


7 (0.0132) 


Shah et al. (9) 


Breast lobular 


32 


16 (50) 


32.5 (0.0034) 


7 (44) (0.0001) 


28 (0.0011) 


Clark et al. (3) 


GBM 


110 


40 (36) 


12.9 (0.7269) 


18 (45) (0.0001) 


13 (0.1896) 


Galante et al. 


Breast HCC1954 


45 


25 (56) 


33.2 (0.0017) 


17 (68) (0.0001) 


64 (0.0001) 


(this study) 















interactions per protein. The set for HCC1954BL con- 
tained 395 proteins with an average of 22.1 interactions 
per protein. These values are not significantly different 
from each other (P = 0.0921, z-test = 1.16875; 
df= 503.76), indicating that the higher degree of inter- 
action observed for HCC1954 is not a direct consequence 
of the mutated proteins belonging to pathways with higher 
connectivity. 

We also looked for common interaction partners among 
the mutated proteins because mutated proteins with 
common partners would probably have synergistic 
tumor-promoting functions (27). Approximately 68% 
(17/25) of the mutated proteins in HCC1954 shared at 
least one common partner, as opposed to none of the 
mutated proteins in HCC1954BL (Figure 4). Simulations 
to correct for the difference in the number of mutated 
proteins in both genomes, revealed that the number of 
mutated proteins sharing a common partner was signifi- 
cantly different from that expected by chance for 
HCC1954 (P< 0.0001, Monte Carlo simulation) but not 
for HCC1954BL {P = 0.855, Monte Carlo simulation). 
Again, to confirm that the differences in the number of 
common interaction partners observed between both cell 
lines were not influenced by the smaller number of 
mutated proteins in HCC1954BL, a Monte Carlo simula- 
tion was again performed in which 1000 random sets of 5 
proteins from HCC1954 carrying non-synonymous muta- 
tions and presenting known PPI were evaluated regarding 
the average number of common interaction partners. The 
number obtained is higher than that obtained for the five 
mutated proteins in HCC1954BL (3.3 versus 0) and again 
different from that expected by chance {P = 0.0245, 
Monte Carlo simulation). 

A total of 64 common partners were identified for 
proteins mutated in HCC1954, of which 51, 10 and 3 
interact with 2, 3 and 4 mutated proteins in HCC1954, 
respectively. Again, the number of common partners 
observed for HCC1954 was significantly different from 
that expected by chance (P< 0.0001, Monte Carlo simu- 
lation). Key cancer genes such as BRCAl, CDC42, 
CHECKl, MDM2, MAP3K1/3 and SMAD2/3 were 
among the 64 common interaction partners (Figure 4). 

Finally, we also investigated the organization of 
mutated proteins into functional networks in other 
tumor genomes recently sequenced. Similar patterns of 
synergy were observed for the set of genes carrying 



non-synonymous point mutations in melanoma, gfioblast- 
oma, lung and breast-tumor genomes (Table 5). Although 
the average number of interactions for mutated proteins 
varied significantly between the different tumors analyzed 
(8.1-32.5 interactions), the percentage of mutated genes 
with common partners was similar among all tumors 
analyzed (varying from 41% to 66%) and different from 
that expected by chance (Table 5). Among the five tumors 
analyzed, the estrogen receptor-positive metastatic lobular 
breast cancer was the one presenting the most similar 
functional organization to HCC1954 with an average of 
32.5 interactions for each mutated protein, 44% of the 
mutated proteins with common partners and 28 
common partners among the mutated proteins. All these 
values are significantly different from that expected by 
chance (P = 0.0034, P = 0.0001 and P = 0.0013, respect- 
ively, Monte Carlo Simulations). 



DISCUSSION 

In this study, we were able to identify, for the first time to 
our knowledge, the somatically acquired genetic alter- 
ations present in the genome of a lymphoblastoid and a 
tumor ceU derived from the same individual. By also 
characterizing the somatic mutations present in the 
lymphoblastoid genome, we were able to show how a 
non-tumor somatic tissue evolves over the same timescale 
and under similar environmental conditions when 
compared to a tumor genome, providing important 
insights into the normal mutational processes and into 
the functional implications of the accumulation of 
somatic alterations in a tumor and a matching 
lymphoblastoid genome. 

A highly complex pattern of chromosomal rearrange- 
ments was exclusively observed in the tumor genome with 
most of these rearrangements affecting genie regions. In 
agreement with these observations, in a previous study in 
which exome and transcriptome data from HCC1954 and 
HCC1954BL were combined to identify genes with LOH 
and allele-specific expression (ASE), large LOH regions, 
harboring genes with ASE and known tumor suppressor 
characteristics, were only detected in the tumor genome 
(28). Together these observations support the concept that 
chromosomal instability is a key feature of human cancer 
and a driving force of tumorigenesis (29). 
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Interestingly, the number of somatically acquired point 
mutations and the spectrum of nucleotide substitutions 
found in the lymphoblastoid genome were comparable 
to that present in the tumor genome. It has been 
proposed that normal point mutation rates are insufficient 
to account for all point mutations observed in tumors and 
that tumor cells must acquire a mutator phenotype, which 
increases the accumulation of point mutations in the 
tumor genome (30). The number of point mutations 
present in both genomes analyzed is in agreement with 
the estimated spontaneous mutation rate of normal 
human cells (~2.10-10/bp per cell division) (31) and the 
difference in the total number of point mutations observed 
for the HCC1954 and HCC1954BL genomes (274/ 
173 = 1.58) is probably due to the higher DNA content 
of the tumor cell line rather than to the existence of a 
mutator phenotype (30). Both cell hues also presented a 
similar spectrum of nucleotide substitutions with a pre- 
dominance of transitions. A similar frequency of 
G>A|C>T transitions was observed for other recently 
reported breast-tumor genomes (4,9) and a comparison to 
the spectrum of nucleotide substitutions reported for a 
melanoma and a lung cancer cell hne indicates that 
neither genome under study had signs of the influence of 
external mutagens such as tobacco or ultraviolet hght 
(7,8). Together these results indicate that the influence of 
endogenous mutagens and replication errors are sufficient 
to generate the overall number of point mutations 
required to drive tumorigenesis and that tumor cells do 
not necessarily need to acquire a mutator phenotype to 
increase the accumulation of point mutations in their 
genomes. 

Although, we cannot completely discard the possibility 
that some of the somatic point mutations detected in 
HCC1954 and HCC1954BL genomes resuh from in vitro 
culturing and EBV transformation (in the case of 
HCC1954BL), we do not expect these mutations to be 
representative or to influence our observations and 
overall conclusions. Both cell fines received similar treat- 
ments in terms of the timing of establisfiment and in vitro 
propagation and have not been maintained in culture for a 
long period (36 passages), reducing the probabifity of 
introducing mutations during the culturing process. We 
have also set up very stringent criteria for somatic point 
mutation detection to filter most, if not aU, non-clonal 
mutations eventually introduced during in vitro culturing 
of these cell fines (see 'Materials and Methods' section). 
Moreover, it has been recently demonstrated that clonal 
point mutations rarely arise during in vitro or in vivo ex- 
perimental growth of tumor cells (32). Jones et al. fiave 
analyzed 289 mutations present in 18 cell fines or xeno- 
grafts, eacfi derived from a different primary tumor. 
Over 99% of these mutations were also present in the 
primary tumors (29). Finally, several studies have 
demonstrated a very strong correlation when DNA from 
EBV-transformed lymphoblastoid cell fines was compared 
to DNA from the corresponding blood samples (33), 
indicating that EBV transformation does not substantial- 
ly alters the mutation rate and genetic stabifity of trans- 
formed cells. Indeed EBV-transformed lympfioblastoid 
cell lines have been widely used as source of DNA in 



genetic screening studies and have also been used as 
source of normal DNA in whole genome studies on the 
occurrence of somatic mutations in tumor genomes (7,8). 

We also do not expect our observations and conclusions 
to be significantly affected by sequencing errors. Since one 
of our criteria for point mutation detection was to have at 
least 20% of the reads reporting the mutated allele, this 
represents an average of four reads considering the 
average exome coverage of 22x. If we use a false-positive 
error rate of ~2% for each position per read for the 454 
sequencing platform (34), our error rate per position 
would be 16 per 100 Mb. Since the exome captured 
region comprises ~31 Mb, the expected number of false 
positive mutations per genome is between 4 and 5 muta- 
tions. Nevertheless, it is important to emphasize that we 
took a conservative approach and used for afi downstream 
analysis (KEGG and PPI) only tfiose mutations that were 
further vafidated by Sanger sequencing. 

Significant functional differences were observed 
between the set of genes mutated in both cefi lines, 
indicating that mutations in the tumor genome are not 
randomly distributed. We showed that non-synonymous 
point mutations are more frequently found in the tumor 
genome and that they preferentially affect hub-genes in 
molecular pathways related to tumorigenesis. Moreover, 
by looking for common interaction partners among 
mutated proteins, we demonstrated, for the the first 
time, that mutations in the tumor genome are co-selected 
to present synergistic tumor-promoting functions. This 
observation was further extended to other individual 
tumor genomes sequenced. Similar patterns of synergy 
were observed for the set of genes carrying 
non-synonymous point mutations in melanoma, gfioblast- 
oma, lung and breast-tumor genomes. 

Tfie functional differences observed between tfie sets of 
genes mutated in the lymphoblastoid and tumor genomes 
raise questions regarding the number and strength of the 
driver genetic alterations required for tumorigenesis. If a 
tumor cell were to require just a smafi number of 'strong' 
driver alterations, we would not expect to see the striking 
functional association of the genes mutated in the tumor 
because the majority of the mutations would be passen- 
gers. Our results thus support the model in which the 
tumor genome has a few 'strong' driver mutations and 
several 'weak' driver mutations that act in a synergistic 
fashion to disrupt molecular pathways related to tumori- 
genesis (35). Although, this model has been proposed in 
the literature for some time, to our knowledge, tfiis is the 
first time the existence of strong and weak driver muta- 
tions is evidenced by comparing genome-wide mutation 
data generated from tumor and non-tumor tissues 
derived from the same individual. We would expect 
these 'weak' drivers to be infrequently mutated and insuf- 
ficient to form tumors in the absence of the strong drivers. 
Distinguishing 'weak' drivers from passenger mutations 
will require a systems biology approach to dissect the in- 
dividual roles of mutated genes and examine their inter- 
actions within the networks and pathways related to 
tumorigenesis. Moreover, this approach will require tfie 
analysis of large numbers of normal genome sequences 
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to identify the permutations of mutated genes that do not 
resuh in tumorigenesis. 
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